Introduction

I have a toxic relationship with R Studio

Hi, I am a student (not anymore very soon) who has been struggling with R Studio. Loving it and hating it (sometimes).

I use statistical analysis like t-tests, ANOVAs, and linear regressions often. I love to make graphs inside R Studio. Therefore I think it would be a great idea to post my journey here, mostly so that I won’t have to go to my past files and copy my codes whenever I want to do a new analysis.

Sometimes I use R Studio to do funny stuff, like drawing national flags, just for fun. So you will probably see those here.

Feel free to leave comments or suggestions, I am still an infant in the fascinating world of R, and am open to learn new things!

Best, cocoyamo

Reykjavik, 2024

RStudio Environment and Installation

first thing first: install and library

I wish someone use this analogy when I was learning R Studio for the very first time.

People are constantly saying “oh you need to install this package”, and therefore I install every packages every time I am using R Studio.

Then people will ask you to “call the package out” using a function called library.

These two things inside R Studio confused me a lot in the beginning, (and I was mocked by a TA who was very inconsiderate of people who have never learned coding before), hopefully here you can find some answer. And I will not mock you. I swear. I know the struggles.

I now think of this as making a sandwich. Pick whatever sandwich you like. I will do a peanut butter and chocolate sandwich here.

So, imagine you are going to make peanut butter and chocolate sandwich. You need ingredients. So you went to a nearby grocery store to purchase chocolate spread, peanut butter, and toast. (Feel free to switch to any other food you like, also please do not follow my recipe if you are allergic to peanut or chocolate!)

This step (getting ingredients from a store) is called install.

Then, you go back to your kitchen, pull these ingredients out from your reusable bag, this action is similar to the library function.

Let’s assume these ingredients will never run out. (What a heaven!) Then, the next time you want the peanut butter and chocolate sandwich, all you need to do is to pull out these ingredients again. There is no need to go back to the store every time you want it, you already have it in your kitchen!

Therefore, whenever you need to use some package you have not used before, you need to install them first, and then afterwards, the only thing you need to do is to use the library function to pull these magical stuff out of your bag.

Okay, I hope you are still with me, and not already in the kitchen looking for your chocolate spread.

Data Import and Reading

Let’s import our data!

Import Your Data

note: If this is your first time import data, and if your data is from an Excel file, then remember to type install.packages("readxl") on your coding space first.

install.packages("readxl")
library(readxl)

There are of course tons of ways to import your data, but the following is what I use most often.

I will click Import Dataset on the top right corner in the Environment box. Then choose import, chose From Excel (or From Text if I have a .csv file), then use Browse on the top right corner to pick up the file I want to import.

You can either double-click or click it once and then click the open at the bottom right.

The middle part is the Data Preview, you can see your data here, don’t import the wrong file!(I’ve done that multiple times, especially when I am lazy at giving files proper names :P)

Then, you look at the bottom right corner, there is a section called Code Preview.

The code can look something like this. (But not identical! Because (hopefully) we are using different file names.)

library(readxl)
wether_math <- read_excel("~/Documents/MMR/example files/wether_math.xlsx")
View(wether_math)

Now you have seen the Code Preview section, at the end of it there is a tiny icon looking like a clipboard, you do what? CLICK IT!

Great Job!!

Now you have copied the code necessary to import your file, press Import at the bottom right.

Go back to the coding section. Paste it (Ctrl/Command + v) on the coding section (the top left section of the interface).

(Let’s hope R Studio never change their interface or else I will have to rewrite this part.)

Also, sometimes you can use head() to see the first few rows of your data, just to make sure everything went well.

head(wether_math)
## # A tibble: 6 × 3
##      id day   score
##   <dbl> <chr> <dbl>
## 1     1 sunny     9
## 2     2 sunny     8
## 3     3 sunny     7
## 4     4 sunny     8
## 5     5 sunny     8
## 6     6 sunny     8

Or you could just type in the data names.

wether_math
## # A tibble: 20 × 3
##       id day   score
##    <dbl> <chr> <dbl>
##  1     1 sunny     9
##  2     2 sunny     8
##  3     3 sunny     7
##  4     4 sunny     8
##  5     5 sunny     8
##  6     6 sunny     8
##  7     7 sunny     6
##  8     8 sunny     5
##  9     9 sunny     3
## 10    10 sunny     4
## 11     1 rainy     1
## 12     2 rainy     3
## 13     3 rainy     5
## 14     4 rainy     7
## 15     5 rainy     6
## 16     6 rainy     5
## 17     7 rainy     6
## 18     8 rainy     7
## 19     9 rainy     5
## 20    10 rainy     3

Wether Math Data Explanation

Let us take a peek at this data, so you will have a better grasp in our future examples.

wether_math
## # A tibble: 20 × 3
##       id day   score
##    <dbl> <chr> <dbl>
##  1     1 sunny     9
##  2     2 sunny     8
##  3     3 sunny     7
##  4     4 sunny     8
##  5     5 sunny     8
##  6     6 sunny     8
##  7     7 sunny     6
##  8     8 sunny     5
##  9     9 sunny     3
## 10    10 sunny     4
## 11     1 rainy     1
## 12     2 rainy     3
## 13     3 rainy     5
## 14     4 rainy     7
## 15     5 rainy     6
## 16     6 rainy     5
## 17     7 rainy     6
## 18     8 rainy     7
## 19     9 rainy     5
## 20    10 rainy     3

This is a made-up data. There are 10 imaginary people participate in this imaginary research. They do math test on both a rainy day and a sunny day. Their scores were recorded. Therefore, the first column is the participant number, the second column is the weather (either sunny or rainy), and the third column is their scores.

Dataset packages

babynames

There are different data packages inside R. We can access those data by simply install them. For example, we can install a package called babynames .

This package contains baby names in the US from year 1880 to year 2017.

Let us take a look.

install.packages("babynames")
library(babynames)
babynames
## # A tibble: 1,924,665 × 5
##     year sex   name          n   prop
##    <dbl> <chr> <chr>     <int>  <dbl>
##  1  1880 F     Mary       7065 0.0724
##  2  1880 F     Anna       2604 0.0267
##  3  1880 F     Emma       2003 0.0205
##  4  1880 F     Elizabeth  1939 0.0199
##  5  1880 F     Minnie     1746 0.0179
##  6  1880 F     Margaret   1578 0.0162
##  7  1880 F     Ida        1472 0.0151
##  8  1880 F     Alice      1414 0.0145
##  9  1880 F     Bertha     1320 0.0135
## 10  1880 F     Sarah      1288 0.0132
## # ℹ 1,924,655 more rows

RStudio creates a separate page for the data when we use View() . We can see that there are 5 columns: year, sex, name, n, and prop.

Wait, what is the prop?

When we encounter questions like this, we can type ?+function/dataset on the Console area. In this case, we typed ?babynames.

Once we hit enter, we can see the Help screen on the right side popped up some information. After reading the explanation of this dataset, we learned that prop means n divided by all babies of that sex with that name born in that year.

starwars

There are also built-in data inside dplyr packages. For example, there is a dataset called starwars.

We can import them with a few lines of codes.

install.packages("dplyr")
library(dplyr)

We just called out the dplyr package, where starwars dataset was stored in.

starwars
## # A tibble: 87 × 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
##  3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
##  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
##  6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
##  9 Biggs D…    183    84 black      light      brown           24   male  mascu…
## 10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Ways to Check Data

These functions are commonly used for checking data.

View() will open a new page for reading datasets.

glimpse() head() tail() dim() slice()

Also, you can check the qualities of datasets. For example, ncol() & nrow() . Both of them can be found at once when using dim().

Overview of Tidyverse and Core Functions

I really love tidyverse. It is a game changer. Three packages I use most often in tidyverse are:

  1. ggplot2
  2. dplyr
  3. tidyr

But first we need to know what it is.

Tidyverse contains several packages, making data transformation and visualization (and my life) easier. Under the umbrella of tidyverse, we will be able to wrangle with data like a pro.

We will delve deeper into the land of ggplot2 in the future chapter. Here, we focus on dplyr and tidyr first.

If you have not use tidyverse before, you need to download the package first.

install.packages("tidyverse")
install.packages("dplyr")
install.packages("tidyr")

Next, let us not forget to also call out our packages, which we use library() function for the job.

library("tidyverse")
library("dplyr")
library("tidyr")

You are all set.

Before we jump into anything, let us meet a funny-looking friend. The pipe.

It looks like this:

%>%

In my humble opinion, this is one of the greatest analogy in this century. Although the way it looks (%>%) does not strike me as a real pipe at all.

Let us use an example to see how it works.

babynames

babynames package is a built-in R package. It contains baby names in the US from year 1880 to year 2017.

We can take a look at the data.

babynames
## # A tibble: 1,924,665 × 5
##     year sex   name          n   prop
##    <dbl> <chr> <chr>     <int>  <dbl>
##  1  1880 F     Mary       7065 0.0724
##  2  1880 F     Anna       2604 0.0267
##  3  1880 F     Emma       2003 0.0205
##  4  1880 F     Elizabeth  1939 0.0199
##  5  1880 F     Minnie     1746 0.0179
##  6  1880 F     Margaret   1578 0.0162
##  7  1880 F     Ida        1472 0.0151
##  8  1880 F     Alice      1414 0.0145
##  9  1880 F     Bertha     1320 0.0135
## 10  1880 F     Sarah      1288 0.0132
## # ℹ 1,924,655 more rows

RStudio should have created a separate page for the data. We can see that there are 5 columns: year, sex, name, n, and prop.

Wait, what are prop?

When we encounter questions like this, we can type ?+function/dataset on the Console area. In this case, we typed ?babynames.

Once we hit enter, we can see the Help screen on the right side popped up some information. After reading the explanation of this dataset, we learned that prop means n divided by all babies of that sex with that name born in that year.

Now, let’s say we want to know names of babies born in year 2000.

babynames %>%
    filter(year == 2000)
## # A tibble: 29,769 × 5
##     year sex   name          n    prop
##    <dbl> <chr> <chr>     <int>   <dbl>
##  1  2000 F     Emily     25953 0.0130 
##  2  2000 F     Hannah    23080 0.0116 
##  3  2000 F     Madison   19967 0.0100 
##  4  2000 F     Ashley    17997 0.00902
##  5  2000 F     Sarah     17697 0.00887
##  6  2000 F     Alexis    17629 0.00884
##  7  2000 F     Samantha  17266 0.00866
##  8  2000 F     Jessica   15709 0.00787
##  9  2000 F     Elizabeth 15094 0.00757
## 10  2000 F     Taylor    15078 0.00756
## # ℹ 29,759 more rows

Here we see all the baby names in 2000!

In the code, I used %>% to direct my dataset babynames into the next function filter().

Inside the filter() function, I told the program I want babies who were born in year 2000. In other words, I filtered out the babies which in the ‘year’ column, says ‘2000’.

A good thing about the %>% is that you can build one after another. Just like you can direct the water from east to west, then from west to north.

Let’s try it out.

babynames %>%
    filter(year == 2000) %>%
    filter(sex == "F")
## # A tibble: 17,653 × 5
##     year sex   name          n    prop
##    <dbl> <chr> <chr>     <int>   <dbl>
##  1  2000 F     Emily     25953 0.0130 
##  2  2000 F     Hannah    23080 0.0116 
##  3  2000 F     Madison   19967 0.0100 
##  4  2000 F     Ashley    17997 0.00902
##  5  2000 F     Sarah     17697 0.00887
##  6  2000 F     Alexis    17629 0.00884
##  7  2000 F     Samantha  17266 0.00866
##  8  2000 F     Jessica   15709 0.00787
##  9  2000 F     Elizabeth 15094 0.00757
## 10  2000 F     Taylor    15078 0.00756
## # ℹ 17,643 more rows

Here we put F with a quotation mark because it is a character, instead of a number. We need to specify that so that RStudio can understands us.

Now the data is showing female babies who were both born in year 2000.

We can also acquire this output in another way.

babynames %>%
 filter(year == 2000 & sex == "F")
## # A tibble: 17,653 × 5
##     year sex   name          n    prop
##    <dbl> <chr> <chr>     <int>   <dbl>
##  1  2000 F     Emily     25953 0.0130 
##  2  2000 F     Hannah    23080 0.0116 
##  3  2000 F     Madison   19967 0.0100 
##  4  2000 F     Ashley    17997 0.00902
##  5  2000 F     Sarah     17697 0.00887
##  6  2000 F     Alexis    17629 0.00884
##  7  2000 F     Samantha  17266 0.00866
##  8  2000 F     Jessica   15709 0.00787
##  9  2000 F     Elizabeth 15094 0.00757
## 10  2000 F     Taylor    15078 0.00756
## # ℹ 17,643 more rows

Let us do more data transformation in the next chapter.

Data Cleaning and Transformation

Here is a dataset for heros. I downloaded from https://github.com/dariomalchiodi/superhero-datascience/blob/master/content/data/heroes.csv?plain=1.

library(readr)
setwd  ("~/Documents/MMR")
heroes <- read_delim("example files/heroes.csv", 
    delim = ";", escape_double = FALSE, trim_ws = TRUE)
View(heroes)

Note that the spread sheet is in .csv form. Therefore, when we import data, we can choose From Text(readr) instead of From Test(Excel).

However, we can see from the Preview to know that the data is not ideal as they only have one rows, with every information packed in each.

Here, we click on Delimiter and choose Semicolon. Telling RStudio that in this dataset, we separate different columns by ; instead of comma or tab.

We can see from the data that it contains loads of heroes. If I want a neater view of data I needed the most, I can use select() to choose which columns I want to include for my further analysis.

heroes |>
  select(c("Name", "Height", "Weight", "Gender", "Eye color", "Hair color", "Strength", "Intelligence"))
## # A tibble: 735 × 8
##    Name      Height Weight Gender `Eye color` `Hair color` Strength Intelligence
##    <chr>      <dbl>  <dbl> <chr>  <chr>       <chr>           <dbl> <chr>       
##  1 A-Bomb      203.  442.  M      Yellow      No Hair           100 moderate    
##  2 Abraxas      NA    NA   M      Blue        Black             100 high        
##  3 Abominat…   203.  442.  M      Green       No Hair            80 good        
##  4 Adam Mon…    NA    NA   M      Blue        Blond              10 good        
##  5 Agent 13    173.   61.0 F      Blue        Blond              NA <NA>        
##  6 Air-Walk…   189.  108.  M      Blue        White              85 average     
##  7 Agent Bob   178.   81.4 M      Brown       Brown              10 low         
##  8 Abe Sapi…   191.   65.4 M      Blue        No Hair            30 high        
##  9 Abin Sur    186.   90.9 M      Blue        No Hair            90 average     
## 10 Angela       NA    NA   F      <NA>        <NA>              100 high        
## # ℹ 725 more rows

Now, if I want to choose heroes that have a high intelligence level, we can use filter().

This function filters out the assigned rows.

heroes |>
  filter(Intelligence == "high")
## # A tibble: 144 × 12
##    Name         Identity       `Birth place`      Publisher Height Weight Gender
##    <chr>        <chr>          <chr>              <chr>      <dbl>  <dbl> <chr> 
##  1 Abraxas      Abraxas        Within Eternity    Marvel C…   NA     NA   M     
##  2 Abe Sapien   Abraham Sapien <NA>               Dark Hor…  191.    65.4 M     
##  3 Angela       <NA>           <NA>               Image Co…   NA     NA   F     
##  4 Yoda         Yoda           <NA>               George L…   66.3   17.0 M     
##  5 Zatanna      Zatanna Zatara <NA>               DC Comics  170.    57.8 F     
##  6 Yellowjacket Hank Pym       Elmsford, New York Marvel C…  183.    83.7 M     
##  7 X-Man        Nate Grey      American Northeas… Marvel C…  176.    61.8 M     
##  8 Wonder Woman Diana Prince   Themyscira         DC Comics  183.    74.7 F     
##  9 Watcher      Uatu           <NA>               Marvel C…   NA     NA   M     
## 10 Warlock      Adam Warlock   The Beehive, Shar… Marvel C…  188.   108.  M     
## # ℹ 134 more rows
## # ℹ 5 more variables: `First appearance` <dbl>, `Eye color` <chr>,
## #   `Hair color` <chr>, Strength <dbl>, Intelligence <chr>

The select() and filter() can be stacked together.

Let us look for heros who are female, has strength over 50, and have high intelligence level.

heroes |>
  select(c("Name", "Gender", "Strength", "Intelligence")) |>
  filter(Gender == "F" & Strength >= 50 & Intelligence == "high")
## # A tibble: 18 × 4
##    Name            Gender Strength Intelligence
##    <chr>           <chr>     <dbl> <chr>       
##  1 Angela          F           100 high        
##  2 Wonder Woman    F           100 high        
##  3 Valkyrie        F            95 high        
##  4 Supergirl       F           100 high        
##  5 Silk Spectre II F            55 high        
##  6 She-Hulk        F           100 high        
##  7 Power Girl      F           100 high        
##  8 Phoenix         F           100 high        
##  9 Lady Bullseye   F            75 high        
## 10 Jean Grey       F            80 high        
## 11 Granny Goodness F           100 high        
## 12 Giganta         F            90 high        
## 13 Faora           F            95 high        
## 14 Donna Troy      F            95 high        
## 15 Cheetah III     F           100 high        
## 16 Cat             F            90 high        
## 17 Captain Marvel  F            90 high        
## 18 Big Barda       F           100 high

Next, if we want the same criteria except that we want both the intelligence good and high.

There are two ways of doing it. The first method is to use | as or. The second is method is to use %in%, meaning that if the data fits either of the option, then choose them.

# method 1
heroes |>
  select(c("Name", "Gender", "Strength", "Intelligence")) |>
  filter(Gender == "F" & Strength >= 50) |>
  filter(Intelligence == "high" | Intelligence == "good")
## # A tibble: 47 × 4
##    Name         Gender Strength Intelligence
##    <chr>        <chr>     <dbl> <chr>       
##  1 Angela       F           100 high        
##  2 Wonder Girl  F            90 good        
##  3 Wonder Woman F           100 high        
##  4 Valkyrie     F            95 high        
##  5 Vindicator   F            65 good        
##  6 Thor Girl    F            85 good        
##  7 T-X          F            65 good        
##  8 Supergirl    F           100 high        
##  9 Stargirl     F            80 good        
## 10 Spider-Gwen  F            55 good        
## # ℹ 37 more rows
# method 2
heroes |>
  select(c("Name", "Gender", "Strength", "Intelligence")) |>
  filter(Gender == "F" & Strength >= 50) |>
  filter(Intelligence %in% c("high", "good"))
## # A tibble: 47 × 4
##    Name         Gender Strength Intelligence
##    <chr>        <chr>     <dbl> <chr>       
##  1 Angela       F           100 high        
##  2 Wonder Girl  F            90 good        
##  3 Wonder Woman F           100 high        
##  4 Valkyrie     F            95 high        
##  5 Vindicator   F            65 good        
##  6 Thor Girl    F            85 good        
##  7 T-X          F            65 good        
##  8 Supergirl    F           100 high        
##  9 Stargirl     F            80 good        
## 10 Spider-Gwen  F            55 good        
## # ℹ 37 more rows

As we can see, both methods led to the same results. I would say the first one is more straight forward, however, I feel cooler when using the second method. :D

If we want to know the heroes’ BMI (body mass index), we can count the index by dividing their weights (in kilograms) by the square of their heights (in meters).

\[ weight(kg)/(height*height(m)) \]

According to this formula, we need to do the following things:

  • add a new column that converts height from centimeters to meters

  • calculate the square of the height in meters

  • divide weight by the square of the heights

heroes |>
  mutate(Height_m = Height / 100) |> # convert heights from cm to m
  mutate(Height_m2 = Height_m*Height_m) |> # square of the heights
  mutate(BMI = Weight / Height_m2) # divide weights by the squrare of the heights
## # A tibble: 735 × 15
##    Name        Identity             `Birth place` Publisher Height Weight Gender
##    <chr>       <chr>                <chr>         <chr>      <dbl>  <dbl> <chr> 
##  1 A-Bomb      Richard Milhouse Jo… Scarsdale, A… Marvel C…   203.  442.  M     
##  2 Abraxas     Abraxas              Within Etern… Marvel C…    NA    NA   M     
##  3 Abomination Emil Blonsky         Zagreb, Yugo… Marvel C…   203.  442.  M     
##  4 Adam Monroe <NA>                 <NA>          NBC - He…    NA    NA   M     
##  5 Agent 13    Sharon Carter        <NA>          Marvel C…   173.   61.0 F     
##  6 Air-Walker  Gabriel Lan          Xandar, a pl… Marvel C…   189.  108.  M     
##  7 Agent Bob   Bob                  <NA>          Marvel C…   178.   81.4 M     
##  8 Abe Sapien  Abraham Sapien       <NA>          Dark Hor…   191.   65.4 M     
##  9 Abin Sur    <NA>                 Ungara        DC Comics   186.   90.9 M     
## 10 Angela      <NA>                 <NA>          Image Co…    NA    NA   F     
## # ℹ 725 more rows
## # ℹ 8 more variables: `First appearance` <dbl>, `Eye color` <chr>,
## #   `Hair color` <chr>, Strength <dbl>, Intelligence <chr>, Height_m <dbl>,
## #   Height_m2 <dbl>, BMI <dbl>

We can always simplified (or complicated, depending on your perspective) by mixing them into one line.

heroes |>
  mutate(BMI = Weight / (Height/100)^2) 
## # A tibble: 735 × 13
##    Name        Identity             `Birth place` Publisher Height Weight Gender
##    <chr>       <chr>                <chr>         <chr>      <dbl>  <dbl> <chr> 
##  1 A-Bomb      Richard Milhouse Jo… Scarsdale, A… Marvel C…   203.  442.  M     
##  2 Abraxas     Abraxas              Within Etern… Marvel C…    NA    NA   M     
##  3 Abomination Emil Blonsky         Zagreb, Yugo… Marvel C…   203.  442.  M     
##  4 Adam Monroe <NA>                 <NA>          NBC - He…    NA    NA   M     
##  5 Agent 13    Sharon Carter        <NA>          Marvel C…   173.   61.0 F     
##  6 Air-Walker  Gabriel Lan          Xandar, a pl… Marvel C…   189.  108.  M     
##  7 Agent Bob   Bob                  <NA>          Marvel C…   178.   81.4 M     
##  8 Abe Sapien  Abraham Sapien       <NA>          Dark Hor…   191.   65.4 M     
##  9 Abin Sur    <NA>                 Ungara        DC Comics   186.   90.9 M     
## 10 Angela      <NA>                 <NA>          Image Co…    NA    NA   F     
## # ℹ 725 more rows
## # ℹ 6 more variables: `First appearance` <dbl>, `Eye color` <chr>,
## #   `Hair color` <chr>, Strength <dbl>, Intelligence <chr>, BMI <dbl>

The results will be the same with either one of the coding methods above. Choose what suits you the best.

There is a fun tip to put the new column on the left, instead of the right as default. The reason I like this method is that it helps me check the added column more easily, especially when there are lots of column. By doing so, I don’t have to go to the very end of the data to see my new added column.

heroes |>
  mutate(BMI = Weight / (Height/100)^2, .before = 1) 
## # A tibble: 735 × 13
##      BMI Name        Identity       `Birth place` Publisher Height Weight Gender
##    <dbl> <chr>       <chr>          <chr>         <chr>      <dbl>  <dbl> <chr> 
##  1 107.  A-Bomb      Richard Milho… Scarsdale, A… Marvel C…   203.  442.  M     
##  2  NA   Abraxas     Abraxas        Within Etern… Marvel C…    NA    NA   M     
##  3 107.  Abomination Emil Blonsky   Zagreb, Yugo… Marvel C…   203.  442.  M     
##  4  NA   Adam Monroe <NA>           <NA>          NBC - He…    NA    NA   M     
##  5  20.3 Agent 13    Sharon Carter  <NA>          Marvel C…   173.   61.0 F     
##  6  30.4 Air-Walker  Gabriel Lan    Xandar, a pl… Marvel C…   189.  108.  M     
##  7  25.6 Agent Bob   Bob            <NA>          Marvel C…   178.   81.4 M     
##  8  17.9 Abe Sapien  Abraham Sapien <NA>          Dark Hor…   191.   65.4 M     
##  9  26.4 Abin Sur    <NA>           Ungara        DC Comics   186.   90.9 M     
## 10  NA   Angela      <NA>           <NA>          Image Co…    NA    NA   F     
## # ℹ 725 more rows
## # ℹ 5 more variables: `First appearance` <dbl>, `Eye color` <chr>,
## #   `Hair color` <chr>, Strength <dbl>, Intelligence <chr>

If I prefer the Name before BMI, I can use .after = "Name" (column name) to let RStudio know I want my new column be put on the right of the column I designated.

heroes |>
  mutate(BMI = Weight / (Height/100)^2, .after = "Name")
## # A tibble: 735 × 13
##    Name          BMI Identity       `Birth place` Publisher Height Weight Gender
##    <chr>       <dbl> <chr>          <chr>         <chr>      <dbl>  <dbl> <chr> 
##  1 A-Bomb      107.  Richard Milho… Scarsdale, A… Marvel C…   203.  442.  M     
##  2 Abraxas      NA   Abraxas        Within Etern… Marvel C…    NA    NA   M     
##  3 Abomination 107.  Emil Blonsky   Zagreb, Yugo… Marvel C…   203.  442.  M     
##  4 Adam Monroe  NA   <NA>           <NA>          NBC - He…    NA    NA   M     
##  5 Agent 13     20.3 Sharon Carter  <NA>          Marvel C…   173.   61.0 F     
##  6 Air-Walker   30.4 Gabriel Lan    Xandar, a pl… Marvel C…   189.  108.  M     
##  7 Agent Bob    25.6 Bob            <NA>          Marvel C…   178.   81.4 M     
##  8 Abe Sapien   17.9 Abraham Sapien <NA>          Dark Hor…   191.   65.4 M     
##  9 Abin Sur     26.4 <NA>           Ungara        DC Comics   186.   90.9 M     
## 10 Angela       NA   <NA>           <NA>          Image Co…    NA    NA   F     
## # ℹ 725 more rows
## # ℹ 5 more variables: `First appearance` <dbl>, `Eye color` <chr>,
## #   `Hair color` <chr>, Strength <dbl>, Intelligence <chr>

One last tip before we move on, if I want to only keep the columns we used during the mutate() process, instead of using select() and specify BMI, Height, and Weight columns, we can also use .keep = "used" to tell RStudio that we only want to keep those columns.

heroes |>
  mutate(BMI = Weight / (Height/100)^2, .keep = "used") 
## # A tibble: 735 × 3
##    Height Weight   BMI
##     <dbl>  <dbl> <dbl>
##  1   203.  442.  107. 
##  2    NA    NA    NA  
##  3   203.  442.  107. 
##  4    NA    NA    NA  
##  5   173.   61.0  20.3
##  6   189.  108.   30.4
##  7   178.   81.4  25.6
##  8   191.   65.4  17.9
##  9   186.   90.9  26.4
## 10    NA    NA    NA  
## # ℹ 725 more rows

Now, what I experienced in elementary school, is that after the health examination each year, students would be categorized by their BMI. Then, homeroom teachers would tell those students who were categorized as “overweight”, “obese”, or “underweight” to be careful of their BMIs.

I am not going to discuss what kind of mental damage this caused me (as a child who often got the “overweight” label), but to use this as an example of how to categorize different numerical data into groups (categories) in RStudio.

According to Wikipedia, BMI can be simply categorized as following:

BMI Category
< 18.5 underweight (UW)
18.5 - 24.9 (inclusive) normal weight (NW)
25.0 - 29.9 (inclusive) overweight (OW)
>= 30 obese (OB)

What we are going to do now, is to also categorize those heroes according to their BMIs.

We are still using mutate() since we are adding new columns containing their BMI status.

heroes |>
  mutate(BMI = Weight / (Height/100)^2, .after = "Name") |> 
  mutate(BMI_status = case_when(BMI < 18.5 ~ "underweight",
                                BMI >= 18.5 & BMI <= 24.9 ~ "normal_weight",
                                BMI >= 25 & BMI <= 29.9 ~ "overweight",
                                BMI >= 30 ~ "obese"), .after = "BMI") -> heroes2
heroes2 |>
  count(BMI_status)
## # A tibble: 5 × 2
##   BMI_status        n
##   <chr>         <int>
## 1 normal_weight   201
## 2 obese           127
## 3 overweight      114
## 4 underweight      41
## 5 <NA>            252

Here, we can see how many heroes are obese.

We can also see that there is a row <NA>, it happens when the data is not complete. If we do not want those data with incomplete BMI information in our further analysis, we can remove the <NA> row by simply putting na.omit() into our codes. It would be easier to do further analysis as well.

heroes2 |>
  na.omit() |>
  count(BMI_status) 
## # A tibble: 4 × 2
##   BMI_status        n
##   <chr>         <int>
## 1 normal_weight    48
## 2 obese            40
## 3 overweight       40
## 4 underweight      13

We can also update our heroes2 data with data with not omissions.

heroes2 |>
  na.omit() -> heroes2

If we want further categorization, for instance, to know how many female and male heroes are obese or underweight, we can achieve it by adding another column into the count() function.

heroes2 |>
  count(Gender, BMI_status)
## # A tibble: 8 × 3
##   Gender BMI_status        n
##   <chr>  <chr>         <int>
## 1 F      normal_weight    26
## 2 F      obese             1
## 3 F      overweight        1
## 4 F      underweight      12
## 5 M      normal_weight    22
## 6 M      obese            39
## 7 M      overweight       39
## 8 M      underweight       1

We can also use group_by() function, which will shortly be introduced in future chapters, to achieve the same result.

heroes2 |>
  group_by(Gender) |>
  count(BMI_status)
## # A tibble: 8 × 3
## # Groups:   Gender [2]
##   Gender BMI_status        n
##   <chr>  <chr>         <int>
## 1 F      normal_weight    26
## 2 F      obese             1
## 3 F      overweight        1
## 4 F      underweight      12
## 5 M      normal_weight    22
## 6 M      obese            39
## 7 M      overweight       39
## 8 M      underweight       1

One last thing before moving on to the next chapter, is that we can actually change the column names in our data without going back to the .csv or .xlsx files!

Let’s say we want to change the column Hair color into Hair_color, and Eye color into Eye_color.

colnames(heroes2)[colnames(heroes2) == "Hair color"] <- "Hair_color"
colnames(heroes2)[colnames(heroes2) == "Eye color"] <- "Eye_color"

Data Wrangling and Reshaping

pivot

This part is a bit abstract when I first learnt it. However, it was very useful when we have a wide data but need to do further analysis. RStudio prefers data being presented as a long format. For example:

Name Height (cm) Age
Lily 150 20
Mary 160 18
John 156 14
Steve 180 15
Amy 174 16

The above table can go on and on, by adding new names below. However, sometimes we will encounter data that looks like this:

Name Eng_C1 Eng_C2 Eng_C3 Math_C1 Math_C2 Math_C3
Kelly 90 95 80 90 80 97
Liam 70 80 94 80 89 79
Henry 79 80 85 96 92 90

This is a made-up data displaying different students’ exam performance. They took exams on both English and Math, from Chapter 1 (C1) to Chapter 3 (C3).

In these cases, it is better to convert our wide data into long data. As we move along, we will start doing analysis (t-tests, ANOVAs…), when we are doing the analysis, it would be easier if we could assign rows as variables. For example, in the above table, I would want the subject (English or Math) to be one independent variable, and their scores as dependent variable.

Let us create the data first.

performance_original <- data.frame(
   Name = c("Kelly", "Liam", "Henry", "Alice", "Jake", "Dan"),
   Eng_C1 = c(90, 70, 79, 98, 81, 79),
   Eng_C2 = c(95, 80, 80, 93, 93, 70),
   Eng_C3 = c(80, 94, 85, 89, 73, 91),
   Math_C1 = c(90, 80, 96, 83, 92, 68),
   Math_C2 = c(80, 89, 92, 72, 84, 83),
   Math_C3 = c(97, 79, 90, 74, 70, 92))

We want to make the data look somewhat like this:

Name Subject Score
Kelly Eng_C1 90
Kelly Eng_C2 95
Kelly Eng_C3 80
Kelly Math_C1 90
Kelly Math_C2 80
Kelly Math_C3 97
Liam Eng_C1 70
Liam Eng_C2 80
Liam Eng_C3 94
Liam Math_C1 80
Liam Math_C2 89
Liam Math_C3 79
Henry Eng_C1 79
Henry Eng_C2 80
… and so on

Here, we will use pivot_longer() to help us.

performance_original |>
  pivot_longer(!Name, names_to = "Subject", values_to = "Score") 
## # A tibble: 36 × 3
##    Name  Subject Score
##    <chr> <chr>   <dbl>
##  1 Kelly Eng_C1     90
##  2 Kelly Eng_C2     95
##  3 Kelly Eng_C3     80
##  4 Kelly Math_C1    90
##  5 Kelly Math_C2    80
##  6 Kelly Math_C3    97
##  7 Liam  Eng_C1     70
##  8 Liam  Eng_C2     80
##  9 Liam  Eng_C3     94
## 10 Liam  Math_C1    80
## # ℹ 26 more rows

We can see that the data format became what we wanted. Let us see what are inside the code chunks. !Name means that I am asking RStudio to not consider the column Name as we still want the dataset contain students’ names.

Right after the !Name, we see the code names_to =. When we are compressing multiple columns into one (like we are doing right now), the name of the new column can be specify by names_to =. On this basis, when we look at values_to =, we can infer that it means putting the numerical data into this new column.

However, there are other ways to accomplish the same result. It works when the original column numbers are not huge, or when you have too many columns to exclude than just names.

performance_original |>
  pivot_longer(cols = c("Eng_C1", "Eng_C2", "Eng_C3", "Math_C1", "Math_C2", "Math_C3"), 
               names_to = "Subject", values_to = "Score") -> performance

Here, we use cols = to select the columns we want to compress. However, in this example, it would be easier to use !Names instead of using cols = and type out all columns that needed compressing.

Now, bear with me when this gets even more abstract.

If we want to further separate our data by chapters. For instance, we want to further separate Subject columns to Subject_em AND Chapter. In this case, we need to use separate() function.

performance |>
  separate(
    col = Subject, 
    into = c("Subject_em", "Chapter"),
    sep = "[^[:alnum:]]+",
    remove = TRUE,
    convert = FALSE) -> performance

I know it looks scary, but please hear me out. This would be one of your best friends if you try to understand it. :)

As we mentioned, we want to further separate our Subject column into Subject_em and Chapter. Hence, we use col = to specify which column is our target. Then, we use into = to tell RStudio that we want to separate the column above to columns with these names.

Then, we saw a bizarre-looking code sep = "[^[:alnum:]]+". We need to separate this line to understand it. In general, it means “one or more non-alphabet or non-numeric characters”.

  • [:alnum:] means the alphabet A-Z (a-z), and number 0-9.

  • [^] means not include.

  • + means one or more.

When we combine the above segments together, means that we want one or more things that are not alphabet nor number. In this case, we just want to fine the character _, and let RStudio knows that this _ is the separation point.

Moving on, we see remove = TRUE and convert = FALSE. Both of them are defaults in this separate() function. We do not have to type it. However, I want to put it here to remind myself.

  • remove = TRUE means to delete the original column Subject in this case.

  • convert = FALSE means that we do not want RStudio to change the types of columns for us. Sometimes after the separation, there would be columns with numerical data (for instance, separate Eng_1 into Eng and 1), if we choose convert = TRUE, then the column would be automatically changed into numeric. However, we can change the column type later on if we want to. We don’t need RStudio to do that for us in this step. Thus, we keep this argument FALSE.

Congratulations! Now we learned how to pivot our wide format data into long format data, and we learned how to separate columns. (After 5 years of learning and using RStuido, I am still stunned by its power. EVERY SINGLE TIME.)

If we have pivot_longer(), there must be a pivot_wider(). As a matter of fact it did! However, like I said, RStudio prefers data with a long format for further analysis and graphing, therefore, I will only mention pivot_longer() here, as in my opinion it is more useful for data analysis and data visualization. (Or I will add it when I am more proficient with pivot_wider() function haha).

join

This is another useful function when we have multiple datasets/files and we want to combine them into one dataset.

Here, we are going to continue using the performance dataset. Let’s say that we also documented their Biology exams in another file.

performance_biology <- data.frame(
   Name = c("Kelly", "Liam", "Henry", "Alice", "Jake", "Dan"),
   Bio_C1 = c(80, 80, 83, 93, 80, 88),
   Bio_C2 = c(73, 76, 75, 77, 79, 83),
   Bio_C3 = c(90, 93, 94, 90, 81, 82))

We pull out the original performance data, and want to add these biology columns on it.

performance_original |>
  left_join(performance_biology, by = "Name") 
##    Name Eng_C1 Eng_C2 Eng_C3 Math_C1 Math_C2 Math_C3 Bio_C1 Bio_C2 Bio_C3
## 1 Kelly     90     95     80      90      80      97     80     73     90
## 2  Liam     70     80     94      80      89      79     80     76     93
## 3 Henry     79     80     85      96      92      90     83     75     94
## 4 Alice     98     93     89      83      72      74     93     77     90
## 5  Jake     81     93     73      92      84      70     80     79     81
## 6   Dan     79     70     91      68      83      92     88     83     82

Here, we can see that we have successfully added the biology columns on the original dataset. Of course, when we have multiple independent variables (columns), we can use by = c("column1", "column2", …) to specify the left join basis.

Yes. You guess right. There is also a right_join() in RStudio. Although it may not be as often used as the left_join() (in my usual data analysis), I still sometimes use it. So let us look at the differences between left_join() and right_join().

left_join(): keeps data on the left, even if there is no matching data on the right.

right_join(): keeps data on the right, even if there is no matching data on the left.

If they encountered missing data, they will put an NA on it. It is as simple as that but still extremely powerful (and magical).

bind

Except for left and right join, there is another way to combine different datasets into one.

dataset1 <- data.frame(
  brand = c("1","2","3"),
  price = c(200, 300, 250),
  profit = c(100, 200, 200))
dataset2 <- data.frame(
  brand = c("1","2","3"),
  sales = c(20, 40, 60))

We have created two separate datassets. One contains a product’s brand, price, and profit. The other contains the sales number.

cbind(dataset1, dataset2) -> merged_data
merged_data
##   brand price profit brand sales
## 1     1   200    100     1    20
## 2     2   300    200     2    40
## 3     3   250    200     3    60

Through cbind (column bind), we can merge two datassets.

However, we can also see that it repeated the column brand into the merged data. If this does not affect your future analysis, then you could leave it. Or you could simply use left_join() (as mentioned above) to do the binding without creating the extra column.

dataset1 |>
  left_join(dataset2, by = "brand")
##   brand price profit sales
## 1     1   200    100    20
## 2     2   300    200    40
## 3     3   250    200    60

Statistical Analysis

t-test

Let us look at the strength difference from different publishers.

heroes2 |>
  group_by(Publisher) |>
  summarize(mean_strength = mean(Strength, na.rm = TRUE))
## # A tibble: 4 × 2
##   Publisher     mean_strength
##   <chr>                 <dbl>
## 1 DC Comics              64.6
## 2 George Lucas           33.3
## 3 Image Comics           60  
## 4 Marvel Comics          43.1

We can see from the result that the mean strength of difference publishers are different. However, is the difference significant?

Let us look at DC Comics vs Marvel Comics.

heroes2 |>
  filter(Publisher %in% c("Marvel Comics", "DC Comics")) |>
  group_by(Publisher) |>
  summarize(Strengths = list(Strength)) |>
  with(t.test(Strengths[[1]], Strengths[[2]], paired = FALSE))
## 
##  Welch Two Sample t-test
## 
## data:  Strengths[[1]] and Strengths[[2]]
## t = 2.8807, df = 32.441, p-value = 0.00698
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   6.297946 36.652054
## sample estimates:
## mean of x mean of y 
##    64.600    43.125

That looks complicated. I do not normally use that.

I first subset different groups of data.

subset(heroes2, Publisher == "Marvel Comics") -> marvel
subset(heroes2, Publisher == "DC Comics") -> DC

Then, I calculate the difference between these two groups of data using t.test().

heroes2 |>
  with(t.test(marvel$Strength, DC$Strength, paired = F))
## 
##  Welch Two Sample t-test
## 
## data:  marvel$Strength and DC$Strength
## t = -2.8807, df = 32.441, p-value = 0.00698
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -36.652054  -6.297946
## sample estimates:
## mean of x mean of y 
##    43.125    64.600

We can see from the result that there is no significant difference on strength level between heroes from DC comics and from Marcel comics.

Here, because the heroes from DC universe are different from the heroes from Marvel universe. Thus, we cannot use pairwise t-test.

However, when we are calculate a same group of people doing different tasks (or doing same task repeatedly), then we can use pairwise t-test.

I hope you still remember the performance data. Let us say we want to look at if there is any difference between exam scores of Chapter 1 and Chapter 3.

# group them first
subset(performance, Subject_em == "Eng" & Chapter == "C1") -> eng_c1
subset(performance, Subject_em == "Eng" & Chapter == "C3") -> eng_c3

performance |>
  with(t.test(eng_c1$Score, eng_c3$Score, paired = T))
## 
##  Paired t-test
## 
## data:  eng_c1$Score and eng_c3$Score
## t = -0.44114, df = 5, p-value = 0.6775
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -17.06789  12.06789
## sample estimates:
## mean difference 
##            -2.5

ANOVA

Here is a template for anova_test().

library(rstatix)
heroes2 |>
  anova_test(
    dv = # dependent variable, 
    wid = # subject name/id, 
    type = NULL, # default: type II ANOVA
    between = # between-subject independent variable,
    within = # within-subject independent variable,
    effect.size = "ges", 
    error = NULL,
    white.adjust = FALSE,
    observed = NULL,
    detailed = TRUE
  )
  • detailed = F is the default. However, I would like my results a bit more informative. Thus, I would normally change it to detailed = T.

  • type = NULL is the default. It uses type 2 ANOVA. When needed to change, just switch it from NULL to 1 or 3.

  • You can add covariate = below within =, above type = , for ANCOVA.

  • effect.size = is the effect size to compute.

    • pes (partial eta squared)

    • ges (generalized eta squared) Default

    • The Help section in RStudio says I could choose both, but I haven’t figure out how to do so yet.

one-way ANOVA

example 1: strength ~ intelligence

IV: intelligence

DV: strength

heroes2 |>
  anova_test(
    dv = Strength, 
    wid = Name, 
    between = Intelligence,
    detailed = T 
  )
## ANOVA Table (type II tests)
## 
##         Effect      SSn    SSd DFn DFd     F     p p<.05   ges
## 1 Intelligence 8758.544 129262   4 136 2.304 0.062       0.063
example 2: strength ~ gender

IV: gender

DV: strength

heroes2 |>
  anova_test(
    dv = Strength, 
    wid = Name, 
    between = Gender,
    detailed = T 
  )
## ANOVA Table (type II tests)
## 
##   Effect      SSn      SSd DFn DFd     F    p p<.05   ges
## 1 Gender 3772.182 134248.4   1 139 3.906 0.05       0.027

two-way ANOVA

example 1: strength ~ BMI_status * gender
heroes2 |>
   anova_test(
    dv = Strength, 
    wid = Name, 
    between = c(Gender, BMI_status), 
    detailed = TRUE
  ) 
## ANOVA Table (type III tests)
## 
##              Effect       SSn      SSd DFn DFd      F        p p<.05      ges
## 1       (Intercept) 68473.044 109803.3   1 133 82.938 1.11e-15     * 0.384000
## 2            Gender    51.278 109803.3   1 133  0.062 8.04e-01       0.000467
## 3        BMI_status  8413.707 109803.3   3 133  3.397 2.00e-02     * 0.071000
## 4 Gender:BMI_status  6885.839 109803.3   3 133  2.780 4.40e-02     * 0.059000

Careful of the ANOVA type. Even if I did not specify which types of ANOVA I want to use, the system chose type III for me. Add type = 2 if you would like to stick with type II ANOVA.

example 2: strength ~ gender * BMI
heroes2 |>
  anova_test(
    dv = Strength, 
    wid = Name, 
    between = c(Gender, BMI), 
    detailed = T 
  )

However, we would encounter an error here. Why is that?

Step 1

We examine if the two independent variables are in different types. For example, categories and numerical variables. This would happen if we put Gender (category) and BMI (numbers) in the between = section.
Solutions: 1. change the numerical columns into categorical column. For example, categorize BMI into different levels like obese, overweight, normal, and underweight. (We did this in the previous section, so I am just going to put the end code here). 2. use ANCOVA (continuous * category ~ continuous) instead of ANOVA. Simply put the numerical variables in the covariate = section. I will not go further on this topic here as we are in the ANOVA section, and as I am not familiar with ANCOVA. Will come back when I use it more often.

# solution 1
heroes2 |>
   anova_test(
    dv = Strength, 
    wid = Name, 
    between = c(Gender, BMI_status), 
    detailed = T 
  )   
## ANOVA Table (type III tests)
## 
##              Effect       SSn      SSd DFn DFd      F        p p<.05      ges
## 1       (Intercept) 68473.044 109803.3   1 133 82.938 1.11e-15     * 0.384000
## 2            Gender    51.278 109803.3   1 133  0.062 8.04e-01       0.000467
## 3        BMI_status  8413.707 109803.3   3 133  3.397 2.00e-02     * 0.071000
## 4 Gender:BMI_status  6885.839 109803.3   3 133  2.780 4.40e-02     * 0.059000
# solution 2 (ANCOVA)
heroes2 |>
   anova_test(
    dv = Strength, 
    wid = Name, 
    between = Gender, 
    covariate = BMI,
    detailed = T 
  )   
## ANOVA Table (type II tests)
## 
##   Effect       SSn      SSd DFn DFd      F        p p<.05      ges
## 1    BMI 23336.275 110912.1   1 138 29.036 2.99e-07     * 1.74e-01
## 2 Gender    10.606 110912.1   1 138  0.013 9.09e-01       9.56e-05
example 3: strength ~ gender * Intelligence

Let us take a look at another example.

We know in the heroes2 data, the Intelligence is a categorical variable. Heroes’ intelligence levels is labelled as high, good, average, moderate, and low.

heroes2 |>
  distinct(Intelligence)
## # A tibble: 5 × 1
##   Intelligence
##   <chr>       
## 1 moderate    
## 2 good        
## 3 high        
## 4 average     
## 5 low

Therefore, it should not encounter the same problem as example 2.

heroes2 |>
  anova_test(
    dv = Strength, 
    wid = Name, 
    between = c(Gender, Intelligence), 
    detailed = T)

However, we still see errors here. If we finished examining that the Step 1 is not the issue here. We can try to debug using the following ways.

Step 2

We examine both independent variables separately to make sure they are both working fine individually. Just delete one of the variables each time to make sure. (Usually this is not the problem, but just in case, I would still check it)

  • checking variable Gender
heroes2 |>
  anova_test(
    dv = Strength, 
    wid = Name, 
    between = Gender, 
    detailed = T)
## ANOVA Table (type II tests)
## 
##   Effect      SSn      SSd DFn DFd     F    p p<.05   ges
## 1 Gender 3772.182 134248.4   1 139 3.906 0.05       0.027
  • checking variable Intelligence
heroes2 |>
  anova_test(
    dv = Strength, 
    wid = Name, 
    between = Intelligence, 
    detailed = T)
## ANOVA Table (type II tests)
## 
##         Effect      SSn    SSd DFn DFd     F     p p<.05   ges
## 1 Intelligence 8758.544 129262   4 136 2.304 0.062       0.063

Both variables look fine and the code work smoothly individually.

Step 3

We examine if there are any missing values in any of the category. (I often forgot about this step, and got confused on how my ANOVA did not work for hours…)

table(heroes2$Gender, heroes2$Intelligence)
##    
##     average good high low moderate
##   F      14   20    4   0        2
##   M      24   41   27   2        7

Here, we can see that in female heroes, low intelligence level, there is a zero. This would make the ANOVA unable to proceed. Now we found the problem, we could simply remove the intelligence == "low" from our analysis as it hinders it.

heroes2 |>
  filter(Intelligence != "low") |>
   anova_test(
    dv = Strength, 
    wid = Name, 
    between = c(Gender, Intelligence), 
    type = 2,
    detailed = T)   
## ANOVA Table (type II tests)
## 
##                Effect      SSn      SSd DFn DFd     F     p p<.05   ges
## 1              Gender 1982.076 126823.2   1 131 2.047 0.155       0.015
## 2        Intelligence 3719.933 126823.2   3 131 1.281 0.284       0.028
## 3 Gender:Intelligence  406.757 126823.2   3 131 0.140 0.936       0.003

Maybe you would think: I’ll just add group_by() before the anova_test(). However, while this would work, it would not be two-way ANOVA as we wanted. Instead, it would just separate the data into two groups before performing one-way ANOVA on each. See the codes below:

heroes2 |>
  filter(Intelligence != "low") |>
  group_by(Gender) |>
   anova_test(
    dv = Strength, 
    wid = Name, 
    between = Intelligence, 
    detailed = T 
  )   
## # A tibble: 2 × 10
##   Gender Effect         SSn    SSd   DFn   DFd     F     p `p<.05`   ges
## * <chr>  <chr>        <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <chr>   <dbl>
## 1 F      Intelligence 1564. 33535.     3    36  0.56 0.645 ""      0.045
## 2 M      Intelligence 2563. 93288.     3    95  0.87 0.46  ""      0.027

We can see the result table is different from the two-way ANOVA table (refer to the upper table), as well as the numbers are different. (See DFn and DFd in both results.)

three-way ANOVA

example 1: intelligence ~ gender * intelligence * BMI_status

If we want to add more independent variables, we could just add more column names in the between = section.

heroes2 |>
  filter(Intelligence != "low") |>
   anova_test(
    dv = Strength, 
    wid = Name, 
    between = c(Gender, Intelligence, BMI_status), 
    type = 2,
    detailed = T 
  )   
## ANOVA Table (type II tests)
## 
##                           Effect       SSn      SSd DFn DFd     F        p
## 1                         Gender  1083.194 96626.99   1 118 1.323 0.252000
## 2                   Intelligence  3465.586 96626.99   3 118 1.411 0.243000
## 3                     BMI_status 16524.941 96626.99   3 118 6.727 0.000314
## 4            Gender:Intelligence  2703.193 96626.99   2 118 1.651 0.196000
## 5              Gender:BMI_status  8295.111 96626.99   3 118 3.377 0.021000
## 6        Intelligence:BMI_status  7715.048 96626.99   7 118 1.346 0.235000
## 7 Gender:Intelligence:BMI_status        NA 96626.99   0 118    NA       NA
##   p<.05   ges
## 1       0.011
## 2       0.035
## 3     * 0.146
## 4       0.027
## 5     * 0.079
## 6       0.074
## 7  <NA>    NA

Note that if the independent variable is within subject (if this is an mixed analysis), then we put the independent variables that is within subject in the within = section. This is often used when analyze experiment results and want to analyze for example the test score before learning and after learning. See below to know more about the mixed ANOVA.

mixed ANOVA

As mentioned, mixed ANOVA is more often used in analyze experiment results as it would be difficult to obtain individual changes over time with other methods.

Let us assume a high school is trying to know if different teaching styles affect students’ exam scores. They performed experiments on students, separating them into two groups, one group received education in a fun way, the other in a rigid way. The school also performed exams before they learn the subject and after learning.

  • Independent variable: teaching style (fun/rigid), exam time (before/after)

  • Dependent variable: exam score

teaching_style <- data.frame(
  student = c("1", "2", "3", "4", "5", "6", 
              "1", "2", "3", "4", "5", "6"),
  style = as.factor(c("fun", "rigid", "fun", "rigid", "fun", "rigid",
                      "fun", "rigid", "fun", "rigid", "fun", "rigid")),
  timing = as.factor(c("before", "before", "before", "before", "before", "before", 
                       "after", "after", "after", "after", "after", "after")),
  score = c(50, 13, 10, 70, 23, 70, 70, 90, 66, 83, 80, 80)
  )

The data frame is created. There are 6 students in total. 3 attended the fun style teaching, 3 experienced the rigid style. All of their before and after exam scores are also presented.

Here, we want to see if the teaching style made differences on their score.

teaching_style |>
   anova_test(
    dv = score, 
    wid = student, 
    between = style, 
    within = timing,
    detailed = T)   
## ANOVA Table (type II tests)
## 
##         Effect DFn DFd       SSn      SSd       F       p p<.05   ges
## 1  (Intercept)   1   4 41418.750 1278.667 129.569 0.00034     * 0.929
## 2        style   1   4   954.083 1278.667   2.985 0.15900       0.232
## 3       timing   1   4  4524.083 1876.667   9.643 0.03600     * 0.589
## 4 style:timing   1   4    90.750 1876.667   0.193 0.68300       0.028

By the analysis, we can see that the style does not have a significant impact on their scores. However, we can see that there is a main effect on timing variable.

We can through looking into the means of the score of different timing to know that the students perform better after learning, no matter what kinds of learning style was implemented on them.

teaching_style |>
  group_by(timing) |>
  summarise(mean_score = mean(score))
## # A tibble: 2 × 2
##   timing mean_score
##   <fct>       <dbl>
## 1 after        78.2
## 2 before       39.3

Additional notes

  • add filter() to choose specific groups you wanted in your analysis. For instance, filter(Intelligence %in% c("high", "good") or filter(Hair_color != "No Hair")
  • group_by() could still come in handy when you want to see both conditions’ ANOVA result.

Post hoc analysis

After ANOVA, if we want to dig deeper into the data (and if the ANOVA has significant result), we can use post hoc analysis.

I used to get confused regarding post hoc analysis and pairwise t test. Note that post hoc analysis comes after we retrieved a significant result from ANOVA, while pairwise t-test can be implemented on whatever two groups of data.

teaching_style |>
 group_by(style) |>
  pairwise_t_test(
    score ~ timing,
    paired = TRUE,
    p.adjust.method = "bonferroni"
  )
## # A tibble: 2 × 11
##   style .y.   group1 group2    n1    n2 statistic    df     p p.adj p.adj.signif
## * <fct> <chr> <chr>  <chr>  <int> <int>     <dbl> <dbl> <dbl> <dbl> <chr>       
## 1 fun   score after  before     3     3      3.64     2 0.068 0.068 ns          
## 2 rigid score after  before     3     3      1.53     2 0.267 0.267 ns

Here, we use post hoc analysis

  • score ~ timing: we put the dependent variable on the left, the independent variable on the right. Here, we compare the scores under different exam timing.

  • paired = if it was repeated measures, then TRUE, if not (implemented on different participants, aka between-subject), then FALSE.

  • p.adjust.method: allows us to choose the method for adjusting p values.

Regression analysis

Data Visualization

bar chart

We took the heroes dataset as an example again.

heroes2 |>
  ggplot(aes(x = Gender, y = Strength)) +
  geom_bar(stat = "summary", position = position_dodge(.8), width = .7, fill = "#BDD5EA") +
  geom_errorbar(stat = "summary", position = position_dodge(0.5), width = .12) +
  facet_wrap(Intelligence ~ ., scales = "free") 

Looks like a lot had happened during in the code chunk, also a lot of information from the graphs.

Let us start with the code.

  • in ggplot(), we specify the x axis and y axis.

  • geom_bar() creates bar chart.

    • stat = "summary" helps you calculate the mean value of the y-axis, Strength in this case.

    • position = position_dodge()

    • width = is the width of the bar. When the width = 1, the bars would stick together.

    • fill = can either be a single color (like in this example), or could be other variables.

  • geom_errorbar() helps calculate the standard error of the value, then presented it on the graph.

    • the width, fill, position remain the same functions as in geom_bar()
  • facet_wrap() helps you to separate graphs according to the variable you put in.

    • You can either do <your variable> ~. or ~ <your variable>.

    • The scales = "free" helps change the y-axis of each graph according to their maximum values. It is however not recommended in some situations, we will explain below.

Then, let’s take a look at the graph. We can see a few things:

  1. The graphs were separated by different intelligence levels.
  2. In the low intelligence group, there is only male heroes’ strength data. Indicating there is no female heroes in the low intelligence group.
  3. The y-axis scale is different from one graph to another. Although it is good that the program changes the scale according to our data, it might hinder some vital information. For example, in the good intelligence group and the high intelligence group, we will explain how to adjust the situation in the scales section.

If we do not want to use facet_wrap() on the graph, we can also use different ways to distinct different categories. For exapmle, if we put fill = with a variable, then we can ask RStudio to help us fill different colors for each category.

heroes2 |>
  ggplot(aes(x = Gender, y = Strength, fill = Intelligence)) +
  geom_bar(stat = "summary", position = position_dodge(.8), width = .7) +
  geom_errorbar(stat = "summary", position = position_dodge(.8), width = .12) 

scatter plot

heroes2 |>
  ggplot(aes(x = Height, y = Strength, color = Gender, shape = Gender)) +
  geom_point() +
  geom_smooth(method = "lm")

We can see from the plot that this is probably not the best data to demonstrate scatter plot when the outliers are obviously affecting the aesthetics. Let us try to move out that male hero on the very right side of the graph. Since we are removing the hero (row), therefore we need to use filter() to help us out.

heroes2 |>
  filter(Height < 250) |>
  ggplot(aes(x = Height, y = Strength, color = Gender, shape = Gender)) +
  geom_point() +
  geom_smooth(method = "lm") 

Now that we have a clearer scatter plot, let us take a look inside the code.

In the aes() section, we can see that I identify the color AND the shape with Gender. Usually one kinds of specification would work, however, sometimes the color might not be easily identifiable. Therefore, it is useful to add another features in this category to make your graph more readable.

  • geom_point() is what we use to identify each data point as a point on the graph. Thus, creating the scatter plot.

  • geom_smooth() is adding trend lines on the data. Here by using method = "lm" meaning that we use linear model, which keeps the line straight.

I am a bit tired of the data we have been using so far, so let us find a new fun dataset.

This dataset is inside an R package called openintro. It contains lots of datasets. We are using one of them called movies.

install.packages("openintro")
library("openintro")
movies
## # A tibble: 140 × 5
##    movie           genre      score rating box_office
##    <chr>           <chr>      <dbl> <chr>       <dbl>
##  1 2Fast2Furious   action      48.9 PG-13      127.  
##  2 28DaysLater     horror      78.2 R           45.1 
##  3 AGuyThing       rom-comedy  39.5 PG-13       15.5 
##  4 AManApart       action      42.9 R           26.2 
##  5 AMightyWind     comedy      79.9 PG-13       17.8 
##  6 AgentCodyBanks  action      57.9 PG          47.8 
##  7 Alex&Emma       rom-comedy  35.1 PG-13       14.2 
##  8 AmericanWedding comedy      50.7 R          104.  
##  9 AngerManagement comedy      62.6 PG-13      134.  
## 10 AnythingElse    rom-comedy  63.3 R            3.21
## # ℹ 130 more rows

In this dataset, there are 5 columns and 140 rows. They are movies released in 2003. The data contains the movie title, genre, score (by critics on a scale 0-100), MPAA rating, and box_office (millions of dollars earned at the box office in the US and Canada.)

You can type movies on the help section on the bottom-right panel to see a more detailed description of the data.

movies |> 
  ggplot(aes(x = box_office, y = score)) + geom_point() +
  geom_smooth(method = "lm")

We can see from the graph, that there seems to be a positive relationship between the box_office and score.

box plot

A boxplot helps us see the interquartile range in our data. This is useful when you are comparing categorical variable and numerical values.

movies |>
  ggplot(aes(x = score, y = rating)) +
  geom_boxplot() 

histogram

movies |>
  ggplot(aes(x = score)) +
  geom_histogram()

movies |>
  ggplot(aes(x = score)) +
  geom_histogram(binwidth = 1)

movies |>
  ggplot(aes(x = score)) +
  geom_histogram(binwidth = 10)

  • binwidth controls the thickness of columns, if it is very thin (binwidth = 1), then you would see each data clearly, however, this is not the best case when we want to see a general trend of the data. Therefore, I would recommend try out different binwidth numbers until you find the one that suits your data the most.
movies |>
  ggplot(aes(x = score, color = rating)) +
  geom_histogram(binwidth = 5)

We can see that the color controls the outline color of the bar. While fill controls the bar itself.

movies |>
  ggplot(aes(x = score, fill = rating)) +
  geom_histogram(binwidth = 5)

density plot

Density plots look like when we connect each high points of the histogram, and create a smoother line of the data.

movies |>
  ggplot(aes(x = score)) +
  geom_density()

We can also separate data into sub categories.

movies |>
  ggplot(aes(x = score, fill = rating, color = rating)) +
  geom_density()

However, this graph is a bit difficult to read. One reason is that there are data overlap. We are unable to see data underneath rating R and PG-13 clearly. Therefore, we are going to change the transparancy of our graph.

movies |>
  ggplot(aes(x = score, fill = rating, color = rating)) +
  geom_density(alpha = 0.5)

We can also separate them more distantly by using geom_density_ridges().

install.packages("ggridges")
library(ggridges)

movies |>
  ggplot(aes(x = score,y = rating,  fill = rating, color = rating)) +
  geom_density_ridges(alpha = 0.6)

mixed graphs

We are creating

We will be using another package: GGally.

install.packages("GGally")
library(GGally)

categorical * numeric

Let us say we want to see if there is any relationship between the lego price and its size.

lego_sample |>
  ggpairs(columns = c("price", "size"))

Let us take a deeper look at the code and the results.

The price in the dataset is a continuous numeric variable, while the size is an categorical variable.

In the result, we can see the top-left panel is the distribution of the price. It is a univariate plot as it is price vs price observed from the columns and rows. Same situation applies to the bottom-right panel. We know that the size is a categorical variable, therefore RStudio helps us count the amount of lego in both sizes.

The top-right panel is price vs size. When comes to numeric variable vs categorical variable, a boxplot can be informative. Thus, RStudio drew the plot for us. The bottom-left panel conveys the same idea as the top-right, as they are both price vs size. It separated different sizes (large and small) and plotted the amount of different prices. Sadly, the graph did not tell us which one belongs to the size large and which one is the size small.

numeric * numeric

Let us take a look at other variables.

In the dataset, there is a column called price, noting the recommended retail price of lego. Also, there is a special column called amazon_price. You can guess from the name that this is the lego price on amazon.

lego_sample |>
  ggpairs(columns = c("price", "amazon_price"))

This graph is different from our examples above. The reason behind is that the two variables here (price and amazon_price) are both numerical.

Therefore, on the top-left panel contains the density plot of the recommended retail price, while on the bottom-right panel there is the amazon_price. We can see that the distribution is roughly the same, however, not identical.

On the bottom-right panel, there is a scatter plot of the data. We can see a relatively strong positive correlation between the two variables, as we can see from the top-right panel, in which the correlation coefficient is shown.

numeric * numeric * numeric

Let us see what would happen if we put only numeric variables inside. Before, we made the variable minifigures a factor instead of its original numeric variable. However, we need numeric variables right now. So we do not need to put as.factor() function here.

lego_sample |>
  ggpairs(columns = c("price", "amazon_price", "minifigures"))

We can see this is a good way to grasp a general idea of the relationships of our data.

If we want some extra information on the graph, for example, we want a trend line on the scatter plot, with the error area.

lego_sample |>
  ggpairs(columns = c("price", "amazon_price", "minifigures"),
          lower = list(continuous = wrap("smooth")))

Here, we simply add lower = and specify that we want.

  • lower = means the lower part from the diagonal line.

  • continuous =: we are telling RStudio our data is continuous numerical variable, we put wrap() to specify how we want to do with the data.

  • smooth: means that we put geom_smooth on the lower panel of diagonal line. 

Note that the correlation method used here is the default "pearson". If you would like to change the method, you could put some extra codes here.

lego_sample |>
  ggpairs(columns = c("price", "amazon_price", "minifigures"),
          upper = list(continuous = wrap("cor", method = "spearman")),
          lower = list(continuous = wrap("smooth")))

We now see an upper = code here. It is a bit different from the lower = part.

  • cor: means that we want our upper panels to have correlation coefficient. Here, if we do not want pearson method as default, we can use method = to specify other correlation coefficient methods.

numeric * numeric * numeric * categorical

Now you have learned a few ways to deal with plots inside ggpairs(). Let us add some colors on it (literally).

If we want to put a categorical variable here, a good way is to separate different groups by colors. We can specify that on our aes() layer.

lego_sample |>
  ggpairs(columns = c("price", "amazon_price", "minifigures", "size"),
          mapping = aes(color = size, alpha = .07),
          upper = list(continuous = wrap("cor", method = "spearman")),
          lower = list(continuous = wrap("smooth")))

Here, as you can see, I also added transparency alpha = as I don’t want the plot be too difficult to read by the overlapping graphs.

Below is another coding style to achieve the same result. I personally prefer this one because it separates the steps of selecting columns I want before moving on to the plotting section.

lego_sample |>
  select(price, amazon_price, minifigures, size) |>
  ggpairs(mapping = aes(color = size, alpha = .07),
          upper = list(continuous = wrap("cor", method = "spearman")),
          lower = list(continuous = wrap("smooth")))

Although ggpairs() comes in handy when you want to look at relationships between different variables (i.e., you want to see whether scores from different tests are related), we can also get dizzy from all the graphs. If you are unsure whether this is the most helpful way to present your data, I would recommend do plots separately first, do a density plot or a box plot first, before you mix everything together. It would help you clear out what you want to do with your data, and how to do it in the best way possible.

additional features on the plot

labels

A good graph contains informative labels. Of course you could add title, subtitle, and other labels after you have downloaded the graph on other software. However, it would be more efficient and cause fewer mistakes when we can finish labelling all at once.

movies |>
  ggplot(aes(x = score,y = rating,  fill = rating, color = rating)) +
  geom_density_ridges(alpha = 0.6) +
  labs(
        title = "Movie Scores of Different MPAA Ratings.",
        subtitle = "Put your subtitle here because I don't know what to say.",
        x = "Movie Score",
        y = "MPAA Rating",
        fill = "Rating",
        color = "Rating",
        caption = "sources: movies from openintro")

If we do not want to present any label, we can also specify it and let the program to delete them for us.

movies |>
  ggplot(aes(x = score,y = rating,  fill = rating, color = rating)) +
  geom_density_ridges(alpha = 0.6) +
  labs(
        x = NULL,
    fill = NULL,
        color = NULL,
        fill = NULL)

We can see from the code chunk above, when we put NULL for the x-axis, the label will not appear in the graph. However, when we did not put y-axis on the labs(), the y-axis would still present the label from your original data.

Also, if you would like to remove the legend as well, you could do that in theme().

movies |>
  ggplot(aes(x = score,y = rating,  fill = rating, color = rating)) +
  geom_density_ridges(alpha = 0.6) +
  labs(x = NULL, y = NULL) + 
  theme(legend.position = "hide")

不知道怎麼改label欸 頭痛 ㄞ 可以mutate但好麻煩呀

colors

Let us demonstrate how to change colors in graph with another dataset!

In this data lego_sample, there are different lego sets, pieces, recommended retail price, prices on Amazon, sizes and so on.

lego_sample
## # A tibble: 75 × 14
##    item_number set_name       theme pieces  price amazon_price  year ages  pages
##          <dbl> <chr>          <chr>  <dbl>  <dbl>        <dbl> <dbl> <chr> <dbl>
##  1       10859 My First Lady… DUPL…      6   4.99        16     2018 Ages…     9
##  2       10860 My First Race… DUPL…      6   4.99         9.45  2018 Ages…     9
##  3       10862 My First Cele… DUPL…     41  15.0         39.9   2018 Ages…     9
##  4       10864 Large Playgro… DUPL…     71  50.0         56.7   2018 Ages…    32
##  5       10867 Farmers' Mark… DUPL…     26  20.0         37.0   2018 Ages…     9
##  6       10870 Farm Animals   DUPL…     16   9.99         9.99  2018 Ages…     8
##  7       10872 Train Bridge … DUPL…     26  25.0         22.0   2018 Ages…    16
##  8       10875 Cargo Train    DUPL…    105 120.         129.    2018 Ages…    64
##  9       10876 Spider-Man & … DUPL…     38  30.0         74.5   2018 Ages…    20
## 10       10878 Rapunzel's To… DUPL…     37  30.0         99.0   2018 Ages…    24
## # ℹ 65 more rows
## # ℹ 5 more variables: minifigures <dbl>, packaging <chr>, weight <chr>,
## #   unique_pieces <dbl>, size <chr>
lego_sample |>
  na.omit() |>
  ggplot(aes(x = size, y = price, fill = as.factor(minifigures))) +
  geom_bar(stat = "summary", position = "dodge")

For easier demonstration, we are going to look into legos only with 1 to 3 minifigures.

lego_sample |>
  na.omit() |>
  filter(minifigures <= 3) |>
  ggplot(aes(x = size, y = price, fill = as.factor(minifigures))) +
  geom_bar(stat = "summary", position = "dodge")

Now we are all set!

lego_sample |>
  na.omit() |>
  filter(minifigures <= 3) |>
  ggplot(aes(x = size, y = price, fill = as.factor(minifigures))) +
  geom_bar(stat = "summary", position = "dodge") +
  scale_fill_manual(values = c("darkblue","yellow","pink"))

As simple as that! Also note that you only need to type each color once. For example, in our graph, although color "darkblue", "yellow", "pink" all appeared twice, we only need to type them once in stead of values = c("darkblue", "yellow", "pink", "darkblue", "yellow", "pink").

If you do not want the default colors, but are too lazy to pick distinct color each time you draw, we can use some color palettes inside RStudio.

install.packages("viridis")
library(viridis)

You can also add lines around the chart to make them pop up more.

lego_sample |>
  na.omit() |>
  filter(minifigures <= 3) |>
  ggplot(aes(x = size, y = price, fill = as.factor(minifigures))) +
  geom_bar(stat = "summary", position = "dodge", color = "black") +
    scale_fill_viridis_d() 

Note that we use scale_fill_viridis_d() here. When you are using color = instead of fill =, remember to switch to scale_color_viridis_d().

lego_sample |>
  na.omit() |>
  filter(minifigures <= 3) |>
  ggplot(aes(x = size, y = price, color = as.factor(minifigures))) +
  geom_bar(stat = "summary", position = "dodge", fill = "white") +
    scale_color_viridis_d() 

scales

Let us look at some data from heroes2 dataset.

heroes2 |>
  filter(Intelligence %in% c("good", "high")) |>
  ggplot(aes(x = Gender, y = Strength)) +
  geom_bar(stat = "summary", position = position_dodge(.8), width = .7) +
  geom_errorbar(stat = "summary", position = position_dodge(0.5), width = .12) +
  facet_wrap(Intelligence ~ ., scales = "free") 

Let us take a quick glanse at the chart. Looks like good intelligence group has higher strength than high intelligence group as the bar is higher.

However, is it really the case?

Take a deeper look at these two bar charts. The y-axis scale is different from one graph to another.

To be sure, let us calculate their average strength between two groups.

heroes2 |>
  filter(Intelligence %in% c("good", "high")) |>
  group_by(Intelligence, Gender) |>
  summarise(mean_strength = mean(Strength))
## # A tibble: 4 × 3
## # Groups:   Intelligence [2]
##   Intelligence Gender mean_strength
##   <chr>        <chr>          <dbl>
## 1 good         F               34  
## 2 good         M               45.4
## 3 high         F               55  
## 4 high         M               57.0

We can see from the result that in the good intelligence group, the mean strength of both genders are lower than the high intelligence group. However, in the graph, if we did not pay attention to the y-axis at first, we might get an wrong impression of the good intelligence group has higher strength than the high intelligence group.

If we want to resolve this issue, we can manually change the scale by using coord_cartesian(). We can either remove the scales = "free" or leave it there. Since the code was implemented layer by layer, when we added the manual scale after scales = "free", then the graph would not be affected.

heroes2 |>
  filter(Intelligence %in% c("good", "high")) |>
  ggplot(aes(x = Gender, y = Strength)) +
  geom_bar(stat = "summary", position = position_dodge(.8), width = .7, fill = "#BDD5EA") +
  geom_errorbar(stat = "summary", position = position_dodge(0.5), width = .12) +
  facet_wrap(Intelligence ~ ., scales = "free") +
  coord_cartesian(ylim = c(0,100)) 

After putting each graph on the same scale, it would be a lot easier to look into the relationships between graphs.

You can modify the ylim() number according to your data.

themes

theme() is the function to change the theme of the graph. I commonly use theme_classic(), theme_light(), or the default theme_gray().

Efficiency Tips for Data Processing

Data Export

Colclusion and Advanced Learning

let’s draw flags!

library(ggplot2)
library(dplyr)

lith <- data.frame(name=c("A","B","C") ,
                   value=c(4,4,4))

lith %>%
  ggplot(aes(x = name, y = value)) +
  geom_bar(stat = "summary", width = 1, fill = c("#BE3A34","#046A38","#FFB81C")) +
  coord_flip()+
  theme_void()

latv <- data.frame(name=c("A","B","C") ,
                   value=c(4,4,4))

latv %>%
  ggplot(aes(x = name, y = value)) +
  geom_bar(stat = "identity", width = c(1.2, 0.9, 1.2), fill = c("#A4343A","#FFFFFF","#A4343A")) +
  coord_flip()+
  theme_void()

I am still a bit struggling with the width part. The flag looks okay but I don’t think this is the correct way to do it.

Now let’s try some with vertical lines.

ital <- data.frame(name=c("A","B","C") ,
                   value=4)

ital %>%
  ggplot(aes(x = name, y = value)) +
  geom_bar(stat = "identity", width = 1, fill = c("#008C45","#F4F9FF","#CD212A")) +
  theme_void()

Still struggling with the Icelandic flag. Stay tuned.

specie <- c("A","B" ,"C","D","E","F","G","H","I")
condition <- c("a","b","c","d","d","c","d","c","d",
               "e","a","b","c","d","c","d" ,"b","c",
               "d","e","a","b","d","c","d","a" ,"b",
               "c","d","e","a","d","c","d","e", "a" ,
               "b","c","d","e","d" ,"c","d","d","e")
value <- c(6, 0.75, 1, 0.75, 3)
data <- data.frame(specie,condition,value)
ggplot(data, aes(fill=condition, y=value, x=specie)) + 
  geom_bar(position="fill", stat="identity", width = 1)+
  scale_fill_manual(values=c("blue", "white", "red","white","blue")) +
  coord_flip() +
  theme_void()